• Why does model performance drop when using time-based train-test splits?

    I’m working on a data science project with time-ordered data, and I’m seeing a significant drop in model performance once I move from training to validation. I’m sharing a simplified version of the problem and code below. The dataset represents events over time, and the target is binary. I initially used a random train-test split,(Read More)

    I’m working on a data science project with time-ordered data, and I’m seeing a significant drop in model performance once I move from training to validation. I’m sharing a simplified version of the problem and code below.

    The dataset represents events over time, and the target is binary. I initially used a random train-test split, but later switched to a time-based split to better reflect real-world usage. After this change, performance dropped sharply, and I’m trying to understand whether this is expected or if I’m doing something wrong.

    Here’s a simplified version of the code:

    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import roc_auc_score
    
    # sample data
    df = pd.read_csv("data.csv")
    df = df.sort_values("event_time")
    
    X = df.drop(columns=["target"])
    y = df["target"]
    
    # time-based split
    split_index = int(len(df) * 0.8)
    X_train, X_test = X.iloc[:split_index], X.iloc[split_index:]
    y_train, y_test = y.iloc[:split_index], y.iloc[split_index:]
    
    model = RandomForestClassifier(random_state=42)
    model.fit(X_train, y_train)
    
    preds = model.predict_proba(X_test)[:, 1]
    print("AUC:", roc_auc_score(y_test, preds))
    

    With a random split, the AUC was around 0.82.
    With the time-based split, it drops to around 0.61.

    I’m trying to understand:

    • Is this performance gap a common sign of data leakage in the original setup?

    • Are tree-based models like Random Forests particularly sensitive to temporal shifts?

    • What are good practices to diagnose whether this is concept drift, feature leakage, or simply a harder prediction problem?

    • Would you approach validation differently for time-dependent data like this?

    Looking for general guidance, validation strategies, or patterns others have seen in similar scenarios.

     

     

  • Future of Data Science Moving Away From Modeling and Toward Problem Framing?

    Data science as a discipline is shifting faster than most people realize. A decade ago, the core skill set revolved around building models, tuning hyperparameters, crafting feature pipelines, and selecting algorithms. But with the rise of AutoML, pretrained foundation models, vector databases, and agentic AI systems, much of the “technical heavy lifting” is becoming automated(Read More)

    Data science as a discipline is shifting faster than most people realize. A decade ago, the core skill set revolved around building models, tuning hyperparameters, crafting feature pipelines, and selecting algorithms. But with the rise of AutoML, pretrained foundation models, vector databases, and agentic AI systems, much of the “technical heavy lifting” is becoming automated or abstracted away.

    Today, the competitive advantage is less about who can write the best model from scratch and more about who can frame the right problem, define meaningful metrics, interpret model outputs responsibly, design data loops, and understand the business impact of predictions. Even the most complex models LLMs, multimodal architectures, time-series forecasters can now be deployed with pre-built frameworks or API calls.

    This shift raises an important question about the future of the field:
    If modeling becomes commoditized, does the true value of a data scientist lie in strategic thinking rather than technical implementation?

  • Why does everyone seem to be choosing data science these days?

    I keep seeing a lot of people jumping into data science especially those without a tech background. Curious why this field is getting so much attention compared to others like cloud, web dev, or cybersec. Is it the salary hype? the job flexibility? or just that it sounds cooler than traditional dev roles? I’m personally(Read More)

    I keep seeing a lot of people jumping into data science especially those without a tech background. Curious why this field is getting so much attention compared to others like cloud, web dev, or cybersec. Is it the salary hype? the job flexibility? or just that it sounds cooler than traditional dev roles? I’m personally torn between data science and going deeper into backend/web dev, so just wanted to hear from folks who’ve already picked a path. what made you choose data over other domains, and was it worth it?

  • How to sync data from multiple sources without writing custom scripts?

    Our team is struggling with integrating data from various sources like Salesforce, Google Analytics, and internal databases. We want to avoid writing custom scripts for each. Is there a tool that simplifies this process?

    Our team is struggling with integrating data from various sources like Salesforce, Google Analytics, and internal databases. We want to avoid writing custom scripts for each. Is there a tool that simplifies this process?

  • Wanting guidance for tech stack of data science

    Hi everyone, I’m currently an undergraduate student in Data Science, actively working toward becoming a data scientist. So far, I’ve built a foundation with basic machine learning models using libraries like Pandas, NumPy, Matplotlib, Scikit-learn, and some PyTorch. I’ve also explored LLMs by working with pre-trained models through Hugging Face and LangChain. Lately, I’ve been(Read More)

    Hi everyone,

    I’m currently an undergraduate student in Data Science, actively working toward becoming a data scientist. So far, I’ve built a foundation with basic machine learning models using libraries like Pandas, NumPy, Matplotlib, Scikit-learn, and some PyTorch. I’ve also explored LLMs by working with pre-trained models through Hugging Face and LangChain. Lately, I’ve been diving into more advanced ML and deep learning concepts, setting up CI/CD pipelines, and learning backend development for ML using FastAPI and Flask.

    Despite experimenting with this wide range of tools and technologies, I still find myself unclear about what companies actually expect from data scientists—both at junior and senior levels. What tech stack should I focus on? Which trends and skills are truly valued in the industry?

    As a student, it’s hard to get a clear answer on this. Could someone with experience in the field help clarify what companies are really looking for in data scientists today?

    Thanks in advance!

Loading more threads